1 Introduction

First, we again create 1 complete dataset, which is done by means of imputing the boys dataset in mice (Van Buuren & Groothuis-Oudshoorn, 2011) with default settings, and complete the dataset. Then, based on this one true dataset, we extract the regression coefficients of a linear regression model in which weight is regressed on age and height, that are used to compare the estimates of the synthetic versions of the datasets. The true regression coefficients are equal to \(\beta_0 = -9.216\), \(\beta_{age} = 2.327\), \(\beta_{height} = 0.191\). Then, for a total of 200 iterations, we impute a matrix with the same dimensions as the boys dataset (that is, the sample size \(n = 748\) and the number of predictors \(k = 9\)). This is done once with the default settings, that is, with pmm for the continuous variables, polr for binary variables and polyreg for ordinal variables with more than two categories. Furthermore, bmi is imputed passively, since it consists of a fixed combination of height and weight bmi = (wgt / (hgt/100)^2. Furthermore, the predictor matrix is adjusted so that imputations for bmi do not flow back into predictions of hgt and wgt. Then, for every synthetic dataset, the estimates, and corresponding variance of the estimates are calculated, as well as the \(95\%~CI\) are calculated. Furthermore, an indicator is added, whether or not the confidence interval includes the true parameter estimates. Note that these confidence intervals are still based on imputations for missing data, instead of imputations for synthetic data. This is due to the fact that the calculation for the degrees of freedom for the estimates based imputing synthetic values might yield degrees of freedom that are so small that the corresponding \(95\% ~ CI\) yields boundary values of \(-\infty\) and \(\infty\).


2 Simulations


2.1 Single dataset approach


2.1.1 True results

broom::tidy(truemodel, conf.int=TRUE) %>%
  mutate(CIW = conf.high - conf.low) %>% 
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term estimate std.error statistic p.value conf.low conf.high CIW
(Intercept) -9.216 2.033 -4.534 0 -13.207 -5.226 7.982
age 2.327 0.189 12.335 0 1.956 2.697 0.741
hgt 0.191 0.028 6.843 0 0.136 0.246 0.110

2.1.2 Use mice to synthesize and to pool


2.1.2.1 Pool the results of the synthetic datasets with the default mice settings.

default_fit <- syns_def %>%
  map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
  map_dfr(function(x) {
    var      <- x[,1]
    true_est <- coefs
    est      <- x[,2]
    true_se  <- sqrt(diag(vcov(truemodel)))
    se       <- x[,3]
    df       <- x[,5]
    lower    <- est + se * qt(.025, df)
    upper    <- est + se * qt(.975, df)
    cov      <- lower < coefs & coefs < upper
  
    bind_cols(var = var, true_est = true_est, est = est, true_se = true_se, 
                               se = se, df = df, lower = lower, upper = upper, cov = cov)
    })

results_def <- default_fit %>%
  group_by(var) %>%
  summarise("True Est" = unique(true_est),
            "Syn Est"  = mean(est),
            "Bias"     = mean(est - true_est),
            "True SE"  = unique(true_se),
            "Syn SE"   = mean(se),
            "df"       = mean(df),
            "Lower"    = mean(lower),
            "Upper"    = mean(upper),
            "CIW"      = mean(upper - lower),
            "Coverage" = mean(cov))

2.1.2.2 Pool the results of the synthetic dataset with CART.

cart_fit <- syns_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
  map_dfr(function(x) {
    var      <- x[,1]
    true_est <- coefs
    est      <- x[,2]
    true_se  <- sqrt(diag(vcov(truemodel)))
    se       <- x[,3]
    df       <- x[,5]
    lower    <- est + se * qt(.025, df)
    upper    <- est + se * qt(.975, df)
    cov      <- lower < coefs & coefs < upper

    bind_cols(var = var, true_est = true_est, est = est, true_se = true_se,
              se = se, df = df, lower = lower, upper = upper, cov = cov)
    })

 results_cart <- cart_fit %>%
   group_by(var) %>%
   summarise("True Est" = unique(true_est),
             "Syn Est"  = mean(est),
             "Bias"     = mean(est - true_est),
             "True SE"  = unique(true_se),
             "Syn SE"   = mean(se),
             "df"       = mean(df),
             "Lower"    = mean(lower),
             "Upper"    = mean(upper),
             "CIW"      = mean(upper - lower),
             "Coverage" = mean(cov))

2.1.2.3 Results for both synthesization methods

bind_rows("Mice default" = results_def,
          "Mice with cart" = results_cart, .id = "Imputation method") %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Imputation method var True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
Mice default (Intercept) -9.216 -2.181 7.035 2.033 2.868 30.450 -8.469 4.107 12.576 0.315
Mice default age 2.327 2.941 0.615 0.189 0.275 18.919 2.327 3.556 1.229 0.420
Mice default hgt 0.191 0.095 -0.096 0.028 0.040 22.509 0.007 0.183 0.176 0.355
Mice with cart (Intercept) -9.216 -7.690 1.526 2.033 2.712 45.199 -13.438 -1.943 11.495 1.000
Mice with cart age 2.327 2.453 0.127 0.189 0.261 37.147 1.892 3.014 1.122 1.000
Mice with cart hgt 0.191 0.171 -0.020 0.028 0.038 38.696 0.089 0.252 0.164 1.000

2.1.3 Use mice to synthesize with pooling rules from Drechsler

For now, we forget about the default mice imputation method, because it leads to flawed results. Therefore, we continue with the CART results only. We fit the lm model in which wgt is regressed on age and hgt.

## pool.lm.syn

## If you have a multiply imputed synthetic dataset, that is, multiple fully imputed 
## datasets that are generated by mice, using an all-ones matrix as the where matrix, 
## this function can be used to analyse the synthetic data by means of a linear 
## regression model, and to pool the results.

pool.syn <- function(mira) {
  
  fitlist <- mira %$% analyses
  
  m <- length(fitlist)
  
  pooled <- fitlist %>% 
    map_dfr(broom::tidy) %>%
    group_by(term) %>%
    summarise(est     = mean(estimate),
              bm      = sum((estimate - est)^2) / (m - 1),
              ubar    = mean(std.error^2),
              var_u   = (1 + 1/m) * bm - ubar,
              var     = if_else(var_u > 0, var_u, ubar),
              df      = max(1, (m - 1) * (1 - ubar / ((1 + 1/m) * bm))^2),
              lower   = est - qt(.975, df) * sqrt(var),
              upper   = est + qt(.975, df) * sqrt(var), .groups = 'drop')
  pooled
}

ci_cov <- function(pooled, true_fit) {
  
  coefs <- coef(true_fit)
  vars   <- diag(vcov(true_fit))
  
  nsim <- nrow(pooled) / length(coefs)
  
  pooled %>% mutate(true_coef = rep(coefs, nsim),
                    true_var  = rep(vars, nsim),
                    cover     = lower < true_coef & true_coef < upper) %>%
    group_by(term) %>%
    summarise("True Est" = unique(true_coef),
              "Syn Est"  = mean(est),
              "Bias"     = mean(est - true_coef),
              "True SE"  = unique(sqrt(true_var)),
              "Syn SE"   = mean(sqrt(var)),
              "df"       = mean(df),
              "Lower"    = mean(lower),
              "Upper"    = mean(upper),
              "CIW"      = mean(upper - lower),
              "Coverage" = mean(cover))
}
models <- syns_cart %>% map(function(x) x %$% lm(wgt ~ age + hgt))

Then, we use the custom pooling function to pool the estimates.

pooled <- models %>% map_dfr(pool.syn)

And another custom function to summarise all results, and obtain the confidence interval coverage.

ci_cov(pooled, truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -7.690 1.526 2.033 1.740 24.983 -20.820 5.439 26.259 0.995
age 2.327 2.453 0.127 0.189 0.165 20.236 1.169 3.737 2.568 1.000
hgt 0.191 0.171 -0.020 0.028 0.024 18.746 -0.019 0.360 0.379 0.995

It can be seen that the coverage of the confidence interval is nearly equal to one. Since we currently only want to make inferences with regard to the sample, this is okay. Namely, if the CI coverage with regard to the sample would be \(95\%\), the CI coverage with regard to the population would be \(0.9025\%\). Inferences with regard to the population will be discussed in the next paragraph.

However, what is questionable, is the fact that the confidence interval width has increased, even though the variance has decreased. Since the confidence interval width is a function of only the standard error and the degrees of freedom, this must be due to the fact that the degrees of freedom are smaller than the degrees of freedom as returned by the mice pooling function.

mice_df <- data.frame(term = cart_fit$var, df = cart_fit$df)
syn_df <- data.frame(term = pooled$term, df = pooled$df)

df <- bind_rows("Mice" = mice_df, "Synthetic" = syn_df, .id = "Method")

ggplot(data = df, aes(x = df, color = Method, fill = Method)) +
  geom_density(alpha = .7) +
  facet_wrap(~ term, nrow = 1, scales = "free") +
  scale_color_viridis_d(begin = .30, end = .65) +
  scale_fill_viridis_d(begin = .30, end = .65) +
  theme_classic()


2.1.4 Different synthesizing sequence

Now we have established that the mice synthesizing algorithm works as good as the synthpop synthesizing algorithm, we can check whether changing the order in which hgt and wgt are synthesized influences the result of the algorithm.

cart_wgt_hgt %>% 
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -7.543 1.674 2.033 1.759 38.478 -20.585 5.500 26.084 1.000
age 2.327 2.470 0.143 0.189 0.167 75.614 1.065 3.875 2.810 0.995
hgt 0.191 0.168 -0.023 0.028 0.024 35.278 -0.018 0.355 0.374 1.000

2.2 Bootstrapped boys data

synthetic_results <- boot_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

norm_results <- boot_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  mutate(df = NA,
         lower = est - qnorm(.975) * sqrt(var),
         upper = est + qnorm(.975) * sqrt(var)) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3, caption = "Normal CI approximation") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

true_results <- bootstrap_boys %>%
  map(~ lm(wgt ~ age + hgt, .x)) %>%
  map_dfr(~ broom::tidy(.x, conf.int = TRUE)) %>%
  mutate(true_coef = rep(coef(truemodel), nsim),
         true_se   = rep(sqrt(diag(vcov(truemodel))), nsim),
         cover     = conf.low < true_coef & true_coef < conf.high) %>%
  group_by(term) %>%
  summarise("True Est" = unique(true_coef),
            "Boot Est"  = mean(estimate),
            "Bias"     = mean(estimate - true_coef),
            "True SE"  = unique(true_se),
            "Boot SE"   = mean(std.error),
            "DF"       = 745,
            "Lower"    = mean(conf.low),
            "Upper"    = mean(conf.high),
            "CIW"      = mean(conf.high - conf.low),
            "Coverage" = mean(cover)) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Boot Est Bias True SE Boot SE DF Lower Upper CIW Coverage
(Intercept) -9.216 -9.200 0.016 2.033 2.023 745 -13.171 -5.229 7.942 0.940
age 2.327 2.325 -0.002 0.189 0.188 745 1.957 2.693 0.737 0.930
hgt 0.191 0.191 0.000 0.028 0.028 745 0.136 0.245 0.109 0.925
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -8.026 1.190 2.033 1.776 95.710 -19.444 3.392 22.836 0.97
age 2.327 2.428 0.101 0.189 0.160 87.524 1.316 3.539 2.223 0.97
hgt 0.191 0.175 -0.016 0.028 0.023 81.177 0.016 0.333 0.317 0.98
Normal CI approximation
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -8.026 1.190 2.033 1.776 NA -11.507 -4.545 6.961 0.855
age 2.327 2.428 0.101 0.189 0.160 NA 2.114 2.741 0.627 0.815
hgt 0.191 0.175 -0.016 0.028 0.023 NA 0.129 0.221 0.092 0.790

References

Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1–67. Retrieved from https://www.jstatsoft.org/v45/i03/